As per Kaggle Website:
Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
Throughout this analysis we will perform feature engineering on predictors one by one in order to try to form the most accurate prediction for house prices. By the end of this analysis the models of ridge, lasso, and cubist will be performed in order to make the final predictions. Of the three, lasso ultimately performed the best and was chosen as the final submission.
library(ggplot2)
library(plyr,include.only = "revalue")
library(dplyr)
library(caret)
library(gridExtra)
library(e1071) #naive bayes
library(corrplot)
library(Metrics)
library(earth)
library(knitr)
options(scipen=999)
train <- read.csv("train.csv",stringsAsFactors = FALSE)
test <- read.csv("test.csv",stringsAsFactors = FALSE)
# Id will not be used in this study so will be dropped from both the train and test set.
Id <- test$Id
train <- train %>%
select(-Id)
test <- test %>%
select(-Id)
print(paste("The train data set contains ", nrow(train)," rows and ",ncol(train), "variables."))
## [1] "The train data set contains 1460 rows and 80 variables."
print(paste("The test data set contains ", nrow(test)," rows and ",ncol(test), "variables."))
## [1] "The test data set contains 1459 rows and 79 variables."
test$SalePrice <- NA
trainRows <- 1:nrow(train)
combined <- rbind(train,test)
The dependent variable we are trying to predict is SalePrice: Sale prices of houses in Ames, Iowa. It exists in the train data set, but not in the test data set. It is skewed to the right with some very high trailing values. This will be addressed in the preparing for modeling section.
salePriceHisto <- ggplot(train,aes(x=SalePrice))+
geom_histogram()+
labs(title="SalePrice")
salePriceBox <- ggplot(train,aes(x=SalePrice))+
geom_boxplot()+
labs(title="SalePrice")
grid.arrange(salePriceHisto,salePriceBox,ncol=2)
There exist many variables with at least one missing value in both the train and test sets.
# train dataset
for(name in names(train)){
if(sum(is.na(train[,name]))>0){
naSum=sum(is.na(train[,name]))
print(paste(name," NA's:",naSum))
}
}
## [1] "LotFrontage NA's: 259"
## [1] "Alley NA's: 1369"
## [1] "MasVnrType NA's: 8"
## [1] "MasVnrArea NA's: 8"
## [1] "BsmtQual NA's: 37"
## [1] "BsmtCond NA's: 37"
## [1] "BsmtExposure NA's: 38"
## [1] "BsmtFinType1 NA's: 37"
## [1] "BsmtFinType2 NA's: 38"
## [1] "Electrical NA's: 1"
## [1] "FireplaceQu NA's: 690"
## [1] "GarageType NA's: 81"
## [1] "GarageYrBlt NA's: 81"
## [1] "GarageFinish NA's: 81"
## [1] "GarageQual NA's: 81"
## [1] "GarageCond NA's: 81"
## [1] "PoolQC NA's: 1453"
## [1] "Fence NA's: 1179"
## [1] "MiscFeature NA's: 1406"
# test dataset
for(name in names(test)){
if(sum(is.na(test[,name]))>0){
naSum = sum(is.na(test[,name]))
print(paste(name," Na's:",naSum))
}
}
## [1] "MSZoning Na's: 4"
## [1] "LotFrontage Na's: 227"
## [1] "Alley Na's: 1352"
## [1] "Utilities Na's: 2"
## [1] "Exterior1st Na's: 1"
## [1] "Exterior2nd Na's: 1"
## [1] "MasVnrType Na's: 16"
## [1] "MasVnrArea Na's: 15"
## [1] "BsmtQual Na's: 44"
## [1] "BsmtCond Na's: 45"
## [1] "BsmtExposure Na's: 44"
## [1] "BsmtFinType1 Na's: 42"
## [1] "BsmtFinSF1 Na's: 1"
## [1] "BsmtFinType2 Na's: 42"
## [1] "BsmtFinSF2 Na's: 1"
## [1] "BsmtUnfSF Na's: 1"
## [1] "TotalBsmtSF Na's: 1"
## [1] "BsmtFullBath Na's: 2"
## [1] "BsmtHalfBath Na's: 2"
## [1] "KitchenQual Na's: 1"
## [1] "Functional Na's: 2"
## [1] "FireplaceQu Na's: 730"
## [1] "GarageType Na's: 76"
## [1] "GarageYrBlt Na's: 78"
## [1] "GarageFinish Na's: 78"
## [1] "GarageCars Na's: 1"
## [1] "GarageArea Na's: 1"
## [1] "GarageQual Na's: 78"
## [1] "GarageCond Na's: 78"
## [1] "PoolQC Na's: 1456"
## [1] "Fence Na's: 1169"
## [1] "MiscFeature Na's: 1408"
## [1] "SaleType Na's: 1"
## [1] "SalePrice Na's: 1459"
A lot of the NA values that exist are known NA values such as NA replacing the option of none. These are known from the Kaggle website and will be fixed now.
Alley: NA’s represent no alley access.
combined[is.na(combined$Alley),"Alley"] <- "noAccess"
Basement Quality, Basement Condition, Basement Exposure, Basement Finish type 1, Basement Finish type 2: NA’s represent no basement.
for(name in c("BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2")){
combined[is.na(combined[,name]),name] <- "none"
}
Fireplace Quality: Na’s represent no fireplace.
combined[is.na(combined$FireplaceQu),"FireplaceQu"] <- "none"
Garage Type, Garage Finish, Garage Quality, Garage Condition: Na’s represent no garages.
for(name in c("GarageType","GarageFinish","GarageQual","GarageCond")){
combined[is.na(combined[,name]),name] <- "none"
}
Pool Quality: NA’s represent no pool.
combined[is.na(combined$PoolQC),"PoolQC"] <- "noPool"
Fence: NA’s represent no fence.
combined[is.na(combined$Fence),"Fence"] <- "noFence"
Miscellaneous: NA’s represent no misc. feature.
combined[is.na(combined$MiscFeature),"MiscFeature"] <- "none"
Most of the remaining NA values in a variable are very small. Sale Price can be ignored as those are the known NA values from the test data set in which we are trying to solve for.
#creating na table
naDF <- data.frame(sapply(combined,function(x) sum(is.na(x))))
naDF$variable <- row.names(naDF)
names(naDF) <- c("naCount","Variable")
naDF <- naDF %>%
select(Variable,naCount) %>%
filter(naCount>0)
#creating na plot
ggplot(naDF,aes(x=reorder(Variable,desc(naCount)),y=naCount,fill=Variable))+
geom_col(aes(),show.legend=FALSE)+
labs(title="Remaining NA Values",
subtitle = "Combined Data Set")+
theme(axis.text.x=element_text(angle=45))+
xlab("Variable")+
ylab("NA Count")
We will now do a deep dive to understand if the remaining missing values can be filled in from the data. If not, we will use imputation.
# counts of nas
combined %>%
select(BsmtFullBath,BsmtHalfBath,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF) %>%
sapply(.,function(x) sum(is.na(x)))
## BsmtFullBath BsmtHalfBath BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 2 2 1 1 1 1
# rows with two BsmtFullBath Nas values
kable(combined[is.na(combined$BsmtFullBath),c("BsmtFullBath","BsmtHalfBath","BsmtFinSF1","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF")])
| BsmtFullBath | BsmtHalfBath | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | |
|---|---|---|---|---|---|---|
| 2121 | NA | NA | NA | NA | NA | NA |
| 2189 | NA | NA | 0 | 0 | 0 | 0 |
For observation 2189 we can see the total basement square feet is 0, so both the basement bathroom variables will be set to 0. In observation 2121 all basement variables are NA so the assumption is going to made that this house has no basement.
# observation 2189
combined[2189,c("BsmtFullBath","BsmtHalfBath")] <- 0
combined[2121,c("BsmtFullBath","BsmtHalfBath","BsmtFinSF1","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF")] <- 0
Of the remaining NA values in the garage variables, GarageYrBlt (Garage year built) has a significant amount of NA values with 159, while garage area and garage cars have only 1.
combined %>% select(GarageYrBlt,GarageArea,GarageCars) %>%
sapply(.,function(x) sum(is.na(x)))
## GarageYrBlt GarageArea GarageCars
## 159 1 1
# all garage variables
garageVariables <- c("GarageArea","GarageCars","GarageCond","GarageFinish","GarageQual","GarageType","GarageYrBlt")
# row where GarageArea is na
combined[is.na(combined$GarageArea),garageVariables]
## GarageArea GarageCars GarageCond GarageFinish GarageQual GarageType
## 2577 NA NA none none none Detchd
## GarageYrBlt
## 2577 NA
All garage variables of observation 2577 are pointing to that house not having a garage except garage type. Because of this, garage area and cars will be set to 0 and garage type will be changed from detached to none.
combined[2577,c("GarageArea","GarageCars")] <- 0
combined[2577,"GarageType"] <- "none"
The NA’s in GarageYrBlt (garage year built) will be addressed next. Since it is a numeric column there is no way to represent not having a garage such as “none”. The options are to divide the variables into sections and make it into a factor column with a none section, or keep it numeric and impute missing values and possibly later on make another variable signifying no garage. The latter option of keeping it numeric will be used.
print(paste("Of the combined rows,",(round(sum(combined$GarageYrBlt==combined$YearBuilt,na.rm=TRUE)/nrow(combined),2)*100),"%", "have the garage year built equalling the year the house was built."))
## [1] "Of the combined rows, 76 % have the garage year built equalling the year the house was built."
Since garage year built is the same as the house year built 76% of the time, the year the house was built will be used as the imputation value.
combined[is.na(combined$GarageYrBlt),"GarageYrBlt"] <- combined[is.na(combined$GarageYrBlt),"YearBuilt"]
Utilities will be dropped, all but one observation are the same.
table(combined$Utilities,exclude=FALSE)
##
## AllPub NoSeWa <NA>
## 2916 1 2
combined <- combined %>% select(-Utilities)
This will be used when doing mode imputation.
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
Exterior1st and exterior2nd each have one NA value. They will be replaced with their modes.
print(paste("Exterior1st NA's:",sum(is.na(combined$Exterior1st))))
## [1] "Exterior1st NA's: 1"
print(paste("Exterior2nd NA's:",sum(is.na(combined$Exterior2nd))))
## [1] "Exterior2nd NA's: 1"
table(combined$Exterior1st,exclude=FALSE)
##
## AsbShng AsphShn BrkComm BrkFace CBlock CemntBd HdBoard ImStucc MetalSd Plywood
## 44 2 6 87 2 126 442 1 450 221
## Stone Stucco VinylSd Wd Sdng WdShing <NA>
## 2 43 1025 411 56 1
table(combined$Exterior2nd,exclude=FALSE)
##
## AsbShng AsphShn Brk Cmn BrkFace CBlock CmentBd HdBoard ImStucc MetalSd Other
## 38 4 22 47 3 126 406 15 447 1
## Plywood Stone Stucco VinylSd Wd Sdng Wd Shng <NA>
## 270 6 47 1014 391 81 1
combined[is.na(combined$Exterior1st),"Exterior1st"] <- getmode(combined$Exterior1st)
combined[is.na(combined$Exterior2nd),"Exterior2nd"] <- getmode(combined$Exterior2nd)
MasVnrType and MasVnrArea contain 23 and 24 NA values. Of the 23 MasVnrType NA’s MasVnrArea is also NA. There is one MasVnrArea observation that is NA without MasVnrType being NA.
sum(is.na(combined$MasVnrArea))
## [1] 23
sum(is.na(combined$MasVnrType))
## [1] 24
combined[is.na(combined$MasVnrArea),c("MasVnrArea","MasVnrType")]
## MasVnrArea MasVnrType
## 235 NA <NA>
## 530 NA <NA>
## 651 NA <NA>
## 937 NA <NA>
## 974 NA <NA>
## 978 NA <NA>
## 1244 NA <NA>
## 1279 NA <NA>
## 1692 NA <NA>
## 1707 NA <NA>
## 1883 NA <NA>
## 1993 NA <NA>
## 2005 NA <NA>
## 2042 NA <NA>
## 2312 NA <NA>
## 2326 NA <NA>
## 2341 NA <NA>
## 2350 NA <NA>
## 2369 NA <NA>
## 2593 NA <NA>
## 2658 NA <NA>
## 2687 NA <NA>
## 2863 NA <NA>
combined[is.na(combined$MasVnrType) & !is.na(combined$MasVnrArea),c("MasVnrArea","MasVnrType")]
## MasVnrArea MasVnrType
## 2611 198 <NA>
For MasVnrType mode imputation will be used and for MasVnrArea median imputation will be used.
#mode imputation
combined[is.na(combined$MasVnrArea),"MasVnrArea"] <- median(combined$MasVnrArea,na.rm=TRUE)
combined[is.na(combined$MasVnrType),"MasVnrType"] <- median(combined$MasVnrType,na.rm=TRUE)
KitchenQual, Functional, SaleType, MSZoning, and Electrical each have only a couple NA’s so mode imputation will be used.
#sum of na's
for(name in c("KitchenQual","Functional","SaleType","MSZoning","Electrical")){
print(name)
print(sum(is.na(combined[,name])))
}
## [1] "KitchenQual"
## [1] 1
## [1] "Functional"
## [1] 2
## [1] "SaleType"
## [1] 1
## [1] "MSZoning"
## [1] 4
## [1] "Electrical"
## [1] 1
#replacing na's with mode
for(name in c("KitchenQual","Functional","SaleType","MSZoning","Electrical")){
combined[is.na(combined[,name]),name] <- getmode(combined[,name])
}
Lot frontage has a considerable amount of NA values with 486. Instead of using median imputation, a simple linear regression model will be built with the variable with the highest correlation, which is lot area.
#finds all numeric variables
numericVariable <- combined %>%
select_if(is.numeric)
#creates dataframe of each numeric variable correlation with LotFrontage
lotFrontageCorr <- as.data.frame(sapply(numericVariable, function(x) cor(combined$LotFrontage,x,use="complete.obs")))
names(lotFrontageCorr) <- "corValues"
lotFrontageCorr %>%
arrange(corValues) %>%
tail(2)
## corValues
## LotArea 0.4898956
## LotFrontage 1.0000000
# building SLR model
slrImputation <- lm(LotFrontage~LotArea,data=combined[!is.na(combined$LotFrontage),])
#results of model
slrImputation
##
## Call:
## lm(formula = LotFrontage ~ LotArea, data = combined[!is.na(combined$LotFrontage),
## ])
##
## Coefficients:
## (Intercept) LotArea
## 51.99510 0.00178
#imputing lot frontage using lot area
combined[is.na(combined$LotFrontage),"LotFrontage"] <- predict(slrImputation,combined[is.na(combined$LotFrontage),])
The NA values are now all resolved.
### plots for numeric variables
numericggplot <- function(variable){
histo <- ggplot(combined,aes(x=combined[,variable]))+
geom_histogram()+
xlab(variable)+
theme(axis.text=element_text(size=14),
axis.title=element_text(size=14,face="bold"))+
labs(title=variable)
trainData <- combined[!is.na(combined$SalePrice),]
scattero <- ggplot(trainData,aes(x=trainData[,variable],y=SalePrice))+
geom_point()+
geom_smooth(method = "lm", se=FALSE, color="black")+
xlab(variable)+
ylab("SalePrice")+
theme(axis.text=element_text(size=14),
axis.title=element_text(size=14,face="bold"))+
labs(title=variable)
grid.arrange(histo,scattero,ncol=2)
}
discreteggplot <- function(name){
discretePlot <- ggplot(combined,aes(x=combined[,name]))+
geom_bar()+xlab(name)
trainData <- combined[!is.na(combined$SalePrice),]
discreteBox <- ggplot(trainData,aes(x=as.character(trainData[,name]),y=SalePrice))+
geom_boxplot()+xlab(name)
grid.arrange(discretePlot,discreteBox,ncol=2)
}
For numeric variables in general if it is an ordered predictor we will make it an integer and if not we will make it a double. This will make the pre-processing (scaling, centering…) much easier.
LotFrontage: Linear feet of street connected to property. There appears to be four outliers with lot frontage above 300 including observations 250, 336, 935, and 1299.
numericggplot("LotFrontage")
kable(combined[combined$LotFrontage>300,])
| MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | X1stFlrSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 250 | 50 | RL | 334.9824 | 159000 | Pave | noAccess | IR2 | Low | CulDSac | Sev | ClearCr | Norm | Norm | 1Fam | 1.5Fin | 6 | 7 | 1958 | 2006 | Gable | CompShg | Wd Sdng | HdBoard | BrkCmn | 472 | Gd | TA | CBlock | Gd | TA | Gd | Rec | 697 | Unf | 0 | 747 | 1444 | GasA | Gd | Y | SBrkr | 1444 | 700 | 0 | 2144 | 0 | 1 | 2 | 0 | 4 | 1 | Gd | 7 | Typ | 2 | TA | Attchd | 1958 | Fin | 2 | 389 | TA | TA | Y | 0 | 98 | 0 | 0 | 0 | 0 | noPool | noFence | Shed | 500 | 6 | 2007 | WD | Normal | 277000 |
| 336 | 190 | RL | 345.0560 | 164660 | Grvl | noAccess | IR1 | HLS | Corner | Sev | Timber | Norm | Norm | 2fmCon | 1.5Fin | 5 | 6 | 1965 | 1965 | Gable | CompShg | Plywood | Plywood | None | 0 | TA | TA | CBlock | TA | TA | Gd | ALQ | 1249 | BLQ | 147 | 103 | 1499 | GasA | Ex | Y | SBrkr | 1619 | 167 | 0 | 1786 | 2 | 0 | 2 | 0 | 3 | 1 | TA | 7 | Typ | 2 | Gd | Attchd | 1965 | Fin | 2 | 529 | TA | TA | Y | 670 | 0 | 0 | 0 | 0 | 0 | noPool | noFence | Shed | 700 | 8 | 2008 | WD | Normal | 228950 |
| 935 | 20 | RL | 313.0000 | 27650 | Pave | noAccess | IR2 | HLS | Inside | Mod | NAmes | PosA | Norm | 1Fam | 1Story | 7 | 7 | 1960 | 2007 | Flat | Tar&Grv | Wd Sdng | Wd Sdng | None | 0 | TA | TA | CBlock | Gd | TA | Gd | GLQ | 425 | Unf | 0 | 160 | 585 | GasA | Ex | Y | SBrkr | 2069 | 0 | 0 | 2069 | 1 | 0 | 2 | 0 | 4 | 1 | Gd | 9 | Typ | 1 | Gd | Attchd | 1960 | RFn | 2 | 505 | TA | TA | Y | 0 | 0 | 0 | 0 | 0 | 0 | noPool | noFence | none | 0 | 11 | 2008 | WD | Normal | 242000 |
| 1299 | 60 | RL | 313.0000 | 63887 | Pave | noAccess | IR3 | Bnk | Corner | Gtl | Edwards | Feedr | Norm | 1Fam | 2Story | 10 | 5 | 2008 | 2008 | Hip | ClyTile | Stucco | Stucco | Stone | 796 | Ex | TA | PConc | Ex | TA | Gd | GLQ | 5644 | Unf | 0 | 466 | 6110 | GasA | Ex | Y | SBrkr | 4692 | 950 | 0 | 5642 | 2 | 0 | 2 | 1 | 3 | 1 | Ex | 12 | Typ | 3 | Gd | Attchd | 2008 | Fin | 2 | 1418 | TA | TA | Y | 214 | 292 | 0 | 0 | 0 | 480 | Gd | noFence | none | 0 | 1 | 2008 | New | Partial | 160000 |
LotArea: Lot size in square feet. Contains only four values above 100,000 including observations 250, 314, 336, and 707.
numericggplot("LotArea")
kable(combined[combined$LotArea>100000,])
| MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | X1stFlrSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 250 | 50 | RL | 334.9824 | 159000 | Pave | noAccess | IR2 | Low | CulDSac | Sev | ClearCr | Norm | Norm | 1Fam | 1.5Fin | 6 | 7 | 1958 | 2006 | Gable | CompShg | Wd Sdng | HdBoard | BrkCmn | 472 | Gd | TA | CBlock | Gd | TA | Gd | Rec | 697 | Unf | 0 | 747 | 1444 | GasA | Gd | Y | SBrkr | 1444 | 700 | 0 | 2144 | 0 | 1 | 2 | 0 | 4 | 1 | Gd | 7 | Typ | 2 | TA | Attchd | 1958 | Fin | 2 | 389 | TA | TA | Y | 0 | 98 | 0 | 0 | 0 | 0 | noPool | noFence | Shed | 500 | 6 | 2007 | WD | Normal | 277000 |
| 314 | 20 | RL | 150.0000 | 215245 | Pave | noAccess | IR3 | Low | Inside | Sev | Timber | Norm | Norm | 1Fam | 1Story | 7 | 5 | 1965 | 1965 | Hip | CompShg | BrkFace | BrkFace | None | 0 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 1236 | Rec | 820 | 80 | 2136 | GasW | TA | Y | SBrkr | 2036 | 0 | 0 | 2036 | 2 | 0 | 2 | 0 | 3 | 1 | TA | 8 | Typ | 2 | Gd | Attchd | 1965 | RFn | 2 | 513 | TA | TA | Y | 0 | 0 | 0 | 0 | 0 | 0 | noPool | noFence | none | 0 | 6 | 2009 | WD | Normal | 375000 |
| 336 | 190 | RL | 345.0560 | 164660 | Grvl | noAccess | IR1 | HLS | Corner | Sev | Timber | Norm | Norm | 2fmCon | 1.5Fin | 5 | 6 | 1965 | 1965 | Gable | CompShg | Plywood | Plywood | None | 0 | TA | TA | CBlock | TA | TA | Gd | ALQ | 1249 | BLQ | 147 | 103 | 1499 | GasA | Ex | Y | SBrkr | 1619 | 167 | 0 | 1786 | 2 | 0 | 2 | 0 | 3 | 1 | TA | 7 | Typ | 2 | Gd | Attchd | 1965 | Fin | 2 | 529 | TA | TA | Y | 670 | 0 | 0 | 0 | 0 | 0 | noPool | noFence | Shed | 700 | 8 | 2008 | WD | Normal | 228950 |
| 707 | 20 | RL | 256.9366 | 115149 | Pave | noAccess | IR2 | Low | CulDSac | Sev | ClearCr | Norm | Norm | 1Fam | 1Story | 7 | 5 | 1971 | 2002 | Gable | CompShg | Plywood | Plywood | Stone | 351 | TA | TA | CBlock | Gd | TA | Gd | GLQ | 1219 | Unf | 0 | 424 | 1643 | GasA | TA | Y | SBrkr | 1824 | 0 | 0 | 1824 | 1 | 0 | 2 | 0 | 2 | 1 | Gd | 5 | Typ | 2 | TA | Attchd | 1971 | Unf | 2 | 739 | TA | TA | Y | 380 | 48 | 0 | 0 | 0 | 0 | noPool | noFence | none | 0 | 6 | 2007 | WD | Normal | 302000 |
\(\textbf{Pick the next tab in order to see other variables.}\)
MiscVal: Value of miscellaneous feature. Most have 0, will likely make new predictor for whether or not a house has a miscellaneous feature.
numericggplot("MiscVal")
\(\textbf{Pick the next tab in order to see other variables.}\)
YearBuilt: Original construction date. The newer the house, the higher the house price. However, there are a couple house built before 1900 that are very expensize. Investigating we can see that these houses had a recent remodel. Possibly an added feature later for when remodel does not equal year built.
numericggplot("YearBuilt")
kable(combined[combined$SalePrice>200000 & combined$YearBuilt<1900 & !is.na(combined$SalePrice),])
| MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | X1stFlrSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 186 | 75 | RM | 90 | 22950 | Pave | noAccess | IR2 | Lvl | Inside | Gtl | OldTown | Artery | Norm | 1Fam | 2.5Fin | 10 | 9 | 1892 | 1993 | Gable | WdShngl | Wd Sdng | Wd Sdng | None | 0 | Gd | Gd | BrkTil | TA | TA | Mn | Unf | 0 | Unf | 0 | 1107 | 1107 | GasA | Ex | Y | SBrkr | 1518 | 1518 | 572 | 3608 | 0 | 0 | 2 | 1 | 4 | 1 | Ex | 12 | Typ | 2 | TA | Detchd | 1993 | Unf | 3 | 840 | Ex | TA | Y | 0 | 260 | 0 | 0 | 410 | 0 | noPool | GdPrv | none | 0 | 6 | 2006 | WD | Normal | 475000 |
| 305 | 75 | RM | 87 | 18386 | Pave | noAccess | Reg | Lvl | Inside | Gtl | OldTown | Norm | Norm | 1Fam | 2.5Fin | 7 | 9 | 1880 | 2002 | Gable | CompShg | CemntBd | CmentBd | None | 0 | TA | TA | BrkTil | TA | TA | No | Unf | 0 | Unf | 0 | 1470 | 1470 | GasA | Ex | Y | SBrkr | 1675 | 1818 | 0 | 3493 | 0 | 0 | 3 | 0 | 3 | 1 | Gd | 10 | Typ | 1 | Ex | Attchd | 2003 | Unf | 3 | 870 | TA | TA | Y | 302 | 0 | 0 | 0 | 0 | 0 | noPool | noFence | none | 0 | 5 | 2008 | WD | Normal | 295000 |
| 584 | 75 | RM | 75 | 13500 | Pave | noAccess | Reg | Lvl | Inside | Gtl | OldTown | Artery | PosA | 1Fam | 2.5Unf | 10 | 9 | 1893 | 2000 | Gable | CompShg | Wd Sdng | Wd Sdng | None | 0 | Ex | Ex | BrkTil | TA | TA | No | Unf | 0 | Unf | 0 | 1237 | 1237 | GasA | Gd | Y | SBrkr | 1521 | 1254 | 0 | 2775 | 0 | 0 | 3 | 1 | 3 | 1 | Gd | 9 | Typ | 1 | Gd | Detchd | 1988 | Unf | 2 | 880 | Gd | TA | Y | 105 | 502 | 0 | 0 | 0 | 0 | noPool | noFence | none | 0 | 7 | 2008 | WD | Normal | 325000 |
| 748 | 70 | RM | 65 | 11700 | Pave | Pave | IR1 | Lvl | Corner | Gtl | OldTown | Norm | Norm | 1Fam | 2Story | 7 | 7 | 1880 | 2003 | Mansard | CompShg | Stucco | Stucco | None | 0 | Gd | TA | Stone | TA | Fa | No | Unf | 0 | Unf | 0 | 1240 | 1240 | GasW | TA | N | SBrkr | 1320 | 1320 | 0 | 2640 | 0 | 0 | 1 | 1 | 4 | 1 | Gd | 8 | Typ | 1 | Gd | Detchd | 1950 | Unf | 4 | 864 | TA | TA | N | 181 | 0 | 386 | 0 | 0 | 0 | noPool | noFence | none | 0 | 5 | 2009 | WD | Normal | 265979 |
YearRemodAdd: Remodel date (same as construction date if no remodeling or additions).
numericggplot("YearRemodAdd")
print(paste(round(sum(combined$YearRemodAdd==combined$YearBuilt)/nrow(combined)*100,0),"% of houses did not have a remodel."))
## [1] "53 % of houses did not have a remodel."
GarageYrBlt: Year garage was built. Most of the time the garage was built when the house was built. There will be an added feature later on to depict whether or not the garage was added after the house was built. There is a clear typo with garage year built after 2200. This value will be replaced with the year that the house was built.
numericggplot("GarageYrBlt")
#percentage of time garage was built when house was built
print(paste(round(sum(combined$GarageYrBlt==combined$YearBuilt)/nrow(combined)*100,0),"% of garages built were built when the house was built."))
## [1] "81 % of garages built were built when the house was built."
#comparing garage year built with house year built
ggplot(combined,aes(x=YearBuilt,y=GarageYrBlt))+
geom_point()
#removing typo
combined[combined$GarageYrBlt>2100,"GarageYrBlt"] <- combined[combined$GarageYrBlt>2100,"YearBuilt"]
\(\textbf{Pick the next tab in order to see other variables.}\)
GarageCars: Size of garage in car capacity.
GarageArea: Size of garage in square feet.
There exists a strong correlation between garage cars and garage area which makes sense. One of the predictors will likely be dropped later on.
discreteggplot("GarageCars")
numericggplot("GarageArea")
#
round(cor(combined$GarageCars,combined$GarageArea),2)
## [1] 0.89
\(\textbf{Pick the next tab in order to see other variables.}\)
MasVnrArea: Masonry veneer area in square feet. There exists a lot of zeros signifying what should be none for MasVnrType: masonry veneer type. Instead, there are three observations with a labeled masonry veneer type that is not none, but have 0 for masonry veneer area. For these observations we will set the 0’s to the median masonry veneer area of those respective masonry veneer types.
numericggplot("MasVnrArea")
table(combined$MasVnrType)
##
## BrkCmn BrkFace None Stone
## 25 879 1766 249
combined[combined$MasVnrArea==0 & combined$MasVnrType!="None",c("MasVnrArea","MasVnrType")]
## MasVnrArea MasVnrType
## 689 0 BrkFace
## 1242 0 Stone
## 2320 0 BrkFace
# replacing 0's of BrkFace with median of MasVnrArea that have BrkFace
combined[combined$MasVnrArea==0 & combined$MasVnrType=="BrkFace","MasVnrArea"] <- median(combined[combined$MasVnrType=="BrkFace","MasVnrArea"])
# replacing 0 of Stone with median of MasVnrArea that has Stone
combined[combined$MasVnrArea==0 & combined$MasVnrType=="Stone","MasVnrArea"] <- median(combined[combined$MasVnrType=="Stone","MasVnrArea"])
\(\textbf{Pick the next tab in order to see other variables.}\)
BsmtFinSF1: Type 1 finished square feet.
BsmtFinSF2: Type 2 finished square feet.
BsmtUnfSF: Unfinished square feet of basement area.
TotalBsmtSF: Total square feet of basement area. Is the sum of finished square feet 1, finished square feet 2, and unfinished square feet. There is one outlier with a total basement square feet above 6000, observation 1299.
numericggplot("BsmtFinSF1")
numericggplot("BsmtFinSF2")
numericggplot("BsmtUnfSF")
numericggplot("TotalBsmtSF")
# Sum of all FinSF1, FinSF2, and UnfSF equal total basement sf.
sum(combined$TotalBsmtSF==combined$BsmtFinSF1+combined$BsmtFinSF2+combined$BsmtUnfSF)/nrow(combined)
## [1] 1
kable(combined[combined$TotalBsmtSF>6000,])
| MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | X1stFlrSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1299 | 60 | RL | 313 | 63887 | Pave | noAccess | IR3 | Bnk | Corner | Gtl | Edwards | Feedr | Norm | 1Fam | 2Story | 10 | 5 | 2008 | 2008 | Hip | ClyTile | Stucco | Stucco | Stone | 796 | Ex | TA | PConc | Ex | TA | Gd | GLQ | 5644 | Unf | 0 | 466 | 6110 | GasA | Ex | Y | SBrkr | 4692 | 950 | 0 | 5642 | 2 | 0 | 2 | 1 | 3 | 1 | Ex | 12 | Typ | 3 | Gd | Attchd | 2008 | Fin | 2 | 1418 | TA | TA | Y | 214 | 292 | 0 | 0 | 0 | 480 | Gd | noFence | none | 0 | 1 | 2008 | New | Partial | 160000 |
\(\textbf{Pick the next tab in order to see other variables.}\)
X1stFlrSF: First Floor square feet.
X2ndFlrSF: Second floor square feet.
LowQualFinSF: Low quality finished square feet (all floors).
GrLivArea: Above grade (ground) living area square feet.
First floor square feet + second floor square feet + low quality square feet = above ground living area for each observation. Observation 1299 comes up again as an outlier, will likely be dropped later on.
numericggplot("X1stFlrSF")
numericggplot("X2ndFlrSF")
numericggplot("LowQualFinSF")
numericggplot("GrLivArea")
sum(combined$X1stFlrSF+combined$X2ndFlrSF+combined$LowQualFinSF==combined$GrLivArea)/nrow(combined)
## [1] 1
kable(combined[combined$GrLivArea>5000,])
| MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | X1stFlrSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1299 | 60 | RL | 313 | 63887 | Pave | noAccess | IR3 | Bnk | Corner | Gtl | Edwards | Feedr | Norm | 1Fam | 2Story | 10 | 5 | 2008 | 2008 | Hip | ClyTile | Stucco | Stucco | Stone | 796 | Ex | TA | PConc | Ex | TA | Gd | GLQ | 5644 | Unf | 0 | 466 | 6110 | GasA | Ex | Y | SBrkr | 4692 | 950 | 0 | 5642 | 2 | 0 | 2 | 1 | 3 | 1 | Ex | 12 | Typ | 3 | Gd | Attchd | 2008 | Fin | 2 | 1418 | TA | TA | Y | 214 | 292 | 0 | 0 | 0 | 480 | Gd | noFence | none | 0 | 1 | 2008 | New | Partial | 160000 |
| 2550 | 20 | RL | 128 | 39290 | Pave | noAccess | IR1 | Bnk | Inside | Gtl | Edwards | Norm | Norm | 1Fam | 1Story | 10 | 5 | 2008 | 2009 | Hip | CompShg | CemntBd | CmentBd | Stone | 1224 | Ex | TA | PConc | Ex | TA | Gd | GLQ | 4010 | Unf | 0 | 1085 | 5095 | GasA | Ex | Y | SBrkr | 5095 | 0 | 0 | 5095 | 1 | 1 | 2 | 1 | 2 | 1 | Ex | 15 | Typ | 2 | Gd | Attchd | 2008 | Fin | 3 | 1154 | TA | TA | Y | 546 | 484 | 0 | 0 | 0 | 0 | noPool | noFence | none | 17000 | 10 | 2007 | New | Partial | NA |
\(\textbf{Pick the next tab in order to see other variables.}\)
BsmtFullBath: Basement full bathrooms.
BsmtHalfBath: Basement half bathrooms.
FullBath: Full bathrooms above grade.
HalfBath: Half baths above grade.
#BsmtFullBath plot
discreteggplot("BsmtFullBath")
#BsmtHalfBath plot
discreteggplot("BsmtHalfBath")
#FullBath
discreteggplot("FullBath")
#HalfBath
discreteggplot("HalfBath")
There is no variable for the total bathrooms in a house. This will be added later on in the feature engineering section.
\(\textbf{Pick the next tab in order to see other variables.}\)
BedroomAbvGr: Bedrooms above grade (does NOT include basement bedrooms).
KitchenAbvGr: Kitchens above grade.
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms).
The total rooms does not equal bedrooms + kitchens. A variable can be made later to address rooms that are not kitchens or bedrooms.
ggplot(combined[!is.na(combined$SalePrice),],aes(x=as.factor(BedroomAbvGr),y=SalePrice))+geom_boxplot()+xlab("Bedrooms Above Ground")
ggplot(combined[!is.na(combined$SalePrice),],aes(x=as.factor(KitchenAbvGr),y=SalePrice))+geom_boxplot()+xlab("Kitchens Above Ground")
ggplot(combined[!is.na(combined$SalePrice),],aes(x=as.factor(TotRmsAbvGrd),y=SalePrice))+geom_boxplot()+xlab("Total Rooms Above Ground")
\(\textbf{Pick the next tab in order to see other variables.}\)
Fireplaces: Number of fireplaces. There is only one house that contains 4 fireplaces and it occurs in the test set.
discreteggplot("Fireplaces")
\(\textbf{Pick the next tab in order to see other variables.}\)
OpenPorchSF: Open porch area in square feet.
EnclosedPorch: Enclosed porch area in square feet.
X3SsnPorch: Three season porch area in square feet.
ScreenPorch: Screen porch area in square feet.
A predictor will be made later for whether or not a house has a porch.
numericggplot("OpenPorchSF")
numericggplot("EnclosedPorch")
numericggplot("X3SsnPorch")
numericggplot("ScreenPorch")
WoodDeckSF: Wood deck area in square feet.
numericggplot("WoodDeckSF")
\(\textbf{Pick the next tab in order to see other variables.}\)
PoolArea: Pool area in square feet. When the quality of the pool is labeled as “noPool”, the pool area should be zero. This is not the case for three observations. When comparing the size of the pool to the quality there seems to be no pattern. The values of quality for these observations will be filled in with good, which is the middle value of pool quality.
numericggplot("PoolArea")
#checking if relation between size and quality
combined[combined$PoolArea!=0,c("PoolArea","PoolQC")] %>% arrange(PoolArea)
## PoolArea PoolQC
## 1 144 Ex
## 2 228 Ex
## 3 368 noPool
## 4 444 noPool
## 5 480 Gd
## 6 512 Ex
## 7 519 Fa
## 8 555 Ex
## 9 561 noPool
## 10 576 Gd
## 11 648 Fa
## 12 738 Gd
## 13 800 Gd
#setting the three pool quality values from nopool to Gd
combined[combined$PoolQC=="noPool" & combined$PoolArea!=0,c("PoolQC")] <- "Gd"
Will make pre-processing section easier.
for(name in c("LotFrontage","LotArea","YearBuilt","YearRemodAdd","MasVnrArea","BsmtFinSF1","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF","X1stFlrSF","X2ndFlrSF","LowQualFinSF","GrLivArea","BsmtFullBath","BsmtHalfBath","FullBath","HalfBath","BedroomAbvGr","KitchenAbvGr","TotRmsAbvGrd","Fireplaces","GarageYrBlt","GarageCars","GarageArea","WoodDeckSF","OpenPorchSF","EnclosedPorch","X3SsnPorch","ScreenPorch","PoolArea","MiscVal")){
combined[,name] <- as.double(combined[,name])
}
MoSold: Month Sold (MM).
YrSold: Year Sold (YYYY).
Most houses seem to sell in the summer months but the average house price per month sold seems pretty even. There is no clear trend upwards or downwards per month over the five years.
ggplot(combined,aes(x=as.character(MoSold)))+
geom_bar()+
scale_x_discrete(limits=seq(1,12))+
xlab("Months")
ggplot(combined[!is.na(combined$SalePrice),],aes(x=as.character(MoSold),y=SalePrice))+
geom_boxplot()+
scale_x_discrete(limits=seq(1,12))+
xlab("Months")
From the previous charts we can see the end of the range of observations is in August 2010, which explains why 2010 has the least number of houses sold. There appears to be a gradual decrease in the median price of a house sold from 2006 to 2010. Both month sold and year sold will be converted to character predictors as there is no apparent ordering of sale price within them.
discreteggplot("YrSold")
combined$YrSold <- as.character(combined$YrSold)
combined$MoSold <- as.character(combined$MoSold)
OverallQual: Rates the overall material and finish of the house
There is a clear increase in the sale of house prices as the quality of the house increases.
overallqualBar <- ggplot(combined,aes(x=OverallQual))+
geom_bar()+
scale_x_discrete(limits=seq(1,10))+
xlab("Overall Quality")
overallqualBox <- ggplot(combined[!is.na(combined$SalePrice),],aes(x=as.character(OverallQual),y=SalePrice))+
geom_boxplot()+
scale_x_discrete(limits=seq(1,10))+
xlab("Overall Quality")
grid.arrange(overallqualBar,overallqualBox,ncol=2)
\(\textbf{Pick the next tab in order to see other variables.}\)
OverallCond: Rates the overall condition of the house.
There is one outlier with overall condition 2 and another at overall condition 6 (observations 379 and 692).
ggplot(combined[!is.na(combined$SalePrice),],aes(x=as.character(OverallCond),y=SalePrice))+
geom_boxplot()+
scale_x_discrete(limits=seq(1,10))+
xlab("Overall Condition")
kable(combined[combined$OverallCond==2 & combined$SalePrice>350000 & !is.na(combined$SalePrice),])
| MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | X1stFlrSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 379 | 20 | RL | 88 | 11394 | Pave | noAccess | Reg | Lvl | Corner | Gtl | StoneBr | Norm | Norm | 1Fam | 1Story | 9 | 2 | 2010 | 2010 | Hip | CompShg | VinylSd | VinylSd | Stone | 350 | Gd | TA | PConc | Ex | TA | Av | GLQ | 1445 | Unf | 0 | 411 | 1856 | GasA | Ex | Y | SBrkr | 1856 | 0 | 0 | 1856 | 1 | 0 | 1 | 1 | 1 | 1 | Ex | 8 | Typ | 1 | Ex | Attchd | 2010 | Fin | 3 | 834 | TA | TA | Y | 113 | 0 | 0 | 0 | 0 | 0 | noPool | noFence | none | 0 | 6 | 2010 | New | Partial | 394432 |
kable(combined[combined$OverallCond==6 & combined$SalePrice>600000 & !is.na(combined$SalePrice),])
| MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | X1stFlrSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 692 | 60 | RL | 104 | 21535 | Pave | noAccess | IR1 | Lvl | Corner | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 10 | 6 | 1994 | 1995 | Gable | WdShngl | HdBoard | HdBoard | BrkFace | 1170 | Ex | TA | PConc | Ex | TA | Gd | GLQ | 1455 | Unf | 0 | 989 | 2444 | GasA | Ex | Y | SBrkr | 2444 | 1872 | 0 | 4316 | 0 | 1 | 3 | 1 | 4 | 1 | Ex | 10 | Typ | 2 | Ex | Attchd | 1994 | Fin | 3 | 832 | TA | TA | Y | 382 | 50 | 0 | 0 | 0 | 0 | noPool | noFence | none | 0 | 1 | 2007 | WD | Normal | 755000 |
\(\textbf{Pick the next tab in order to see other variables.}\)
ExterQual: Evaluates the quality of the material on the exterior.
ExterCond: Evaluates the present condition of the material on the exterior.
External quality and external condition will both be made into ordered predictors.
noNone <- c('Po' = 0, 'Fa' = 1, 'TA' = 2, 'Gd' = 3, 'Ex' = 4)
#making into ordered
combined$ExterQual<-as.integer(revalue(combined$ExterQual, noNone))
table(train$ExterQual)
##
## Ex Fa Gd TA
## 52 14 488 906
combined$ExterCond<-as.integer(revalue(combined$ExterCond, noNone))
discreteggplot("ExterQual")
discreteggplot("ExterCond")
\(\textbf{Pick the next tab in order to see other variables.}\)
BsmtQual: Evaluates the height of the basement.
BsmtCond: Evaluates the general condition of the basement.
Both basement quality and basement condition will be made into ordered predictors.
table(combined$BsmtCond)
##
## Fa Gd none Po TA
## 104 122 82 5 2606
withNone <-c('none' = 0, 'Po' = 1, 'Fa' = 2, 'TA' = 3, 'Gd' = 4, 'Ex' = 5)
combined$BsmtQual <- as.integer(revalue(combined$BsmtQual,withNone))
combined$BsmtCond <- as.integer(revalue(combined$BsmtCond,withNone))
discreteggplot("BsmtQual")
discreteggplot("BsmtCond")
\(\textbf{Pick the next tab in order to see other variables.}\)
KitchenQual: Kitchen quality.
Kitchen quality made into ordered.
combined$KitchenQual <- as.integer(revalue(combined$KitchenQual,noNone))
discreteggplot("KitchenQual")
\(\textbf{Pick the next tab in order to see other variables.}\)
FireplaceQu: Fireplace quality.
Converted to ordered.
combined$FireplaceQu <- as.integer(revalue(combined$FireplaceQu,withNone))
discreteggplot("FireplaceQu")
\(\textbf{Pick the next tab in order to see other variables.}\)
GarageQual: Garage quality.
GarageCond: Garage condition.
Converting both to ordered.
combined$GarageQual <- as.integer(revalue(combined$GarageQual,withNone))
combined$GarageCond <- as.integer(revalue(combined$GarageCond,withNone))
discreteggplot("GarageQual")
discreteggplot("GarageCond")
\(\textbf{Pick the next tab in order to see other variables.}\)
PoolQC: Pool quality.
Pool quality made into ordered. Can make feature later for whether or not a house has a pool.
poolOrdered <- c('noPool'=0,'Fa'=1,'TA'=2,'Gd'=3,'Ex'=4)
combined$PoolQC <- as.integer(revalue(combined$PoolQC,poolOrdered))
print(paste("Only ",sum(combined$PoolQC!=0)," of",nrow(combined)," houses have a pool."))
## [1] "Only 13 of 2919 houses have a pool."
discreteggplot("PoolQC")
\(\textbf{Pick the next tab in order to see other variables.}\)
Street: Type of road access to property.
Made into ordered predictor with paved being superior to gravel.
combined[combined$Street=="Grvl","Street"] <- 0
combined[combined$Street=="Pave","Street"] <- 1
combined$Street <- as.integer(combined$Street)
discreteggplot("Street")
print(paste("There are only ",sum(combined$Street==0)," houses that have a gravel steet."))
## [1] "There are only 12 houses that have a gravel steet."
\(\textbf{Pick the next tab in order to see other variables.}\)
LandSlope: Slope of property.
Made into ordered with Gtl>Mod>Sev.
#order labeling
combined[combined$LandSlope=="Gtl","LandSlope"] <- 0
combined[combined$LandSlope=="Mod","LandSlope"] <- 1
combined[combined$LandSlope=="Sev","LandSlope"] <- 2
discreteggplot("LandSlope")
\(\textbf{Pick the next tab in order to see other variables.}\)
HeatingQC: Heating quality and condition.
CentralAir: Central air conditioning.
Electrical: Electrical system.
Heating quality, central air, and electrical will all be changed into ordered.
noNone
## Po Fa TA Gd Ex
## 0 1 2 3 4
combined$HeatingQC <- as.integer(revalue(combined$HeatingQC,noNone))
combined$CentralAir <- as.integer(revalue(combined$CentralAir,c('N'=0,'Y'=1)))
combined$Electrical <- as.integer(revalue(combined$Electrical,c('Mix'=0,'FuseP'=1,'FuseF'=2,'FuseA'=3,'SBrkr'=4)))
discreteggplot("HeatingQC")
discreteggplot("CentralAir")
discreteggplot("Electrical")
\(\textbf{Pick the next tab in order to see other variables.}\)
MasVnrType: Masonry veneer type.
Made into ordered predictor with the sequential order of None, Brick Common, Brick Face, and then stone. Cinder block is not in the dataset so it wont be included, not sure where it would rank.
combined$MasVnrType <- as.integer(revalue(combined$MasVnrType,c('None'=0,'BrkCmn'=1,'BrkFace'=2,'Stone'=3)))
discreteggplot("MasVnrType")
\(\textbf{Pick the next tab in order to see other variables.}\)
BsmtExposure: Refers to walkout or garden level walls.
Made into ordered based on level of exposure.
table(combined$BsmtExposure)
##
## Av Gd Mn No none
## 418 276 239 1904 82
combined$BsmtExposure <- as.integer(revalue(combined$BsmtExposure,c('none'=0,'No'=1,'Mn'=2,'Av'=3,'Gd'=4)))
discreteggplot("BsmtExposure")
\(\textbf{Pick the next tab in order to see other variables.}\)
GarageFinish: Interior finish of the garage.
Garage finish made into ordered.
combined$GarageFinish <- as.integer(revalue(combined$GarageFinish,c('none'=0,'Unf'=1,'RFn'=2,'Fin'=3)))
discreteggplot("GarageFinish")
\(\textbf{Pick the next tab in order to see other variables.}\)
PavedDrive: Paved driveway.
combined$PavedDrive <- as.integer(revalue(combined$PavedDrive,c('N'=0,'P'=1,'Y'=2)))
discreteggplot("PavedDrive")
\(\textbf{Pick the next tab in order to see other variables.}\)
Fence: Fence quality.
It appears having no fence translates to a higher sales price than having a fence. Converting this predictor to ordered with having no fence being the most valuable.
combined$Fence <- as.integer(revalue(combined$Fence,c('MnWw'=0,'GdWo'=1,'MnPrv'=2,'GdPrv'=3,'noFence'=4)))
discreteggplot("Fence")
Will make pre-processing section easier.
for(name in c("Street","OverallQual","OverallCond","ExterQual","ExterCond","BsmtQual","BsmtCond","BsmtExposure","HeatingQC","CentralAir","KitchenQual","FireplaceQu","GarageFinish","GarageQual","GarageCond","PavedDrive","PoolQC","Fence")){
combined[,name] <- as.integer(combined[,name])
}
MSSubClass: Identifies the type of dwelling involved in the sale.
The type of dwelling is currently a numeric predictor, it should be a factor. For now it will be made into a string and factor later on. Rare labels will also be addressed later on.
str(combined$MSSubClass)
## int [1:2919] 60 20 60 70 60 50 20 60 50 190 ...
combined$MSSubClass <- as.character(combined$MSSubClass)
discreteggplot("MSSubClass")
\(\textbf{Pick the next tab in order to see other variables.}\)
MSZoning: Identifies the general zoning classification of the sale.
discreteggplot("MSZoning")
\(\textbf{Pick the next tab in order to see other variables.}\)
Alley: Type of alley access to property.
discreteggplot("Alley")
\(\textbf{Pick the next tab in order to see other variables.}\)
LotShape: General shape of property.
Could be an ordered predictor, box-plot doesn’t support this claim though so leaving as character.
discreteggplot("LotShape")
\(\textbf{Pick the next tab in order to see other variables.}\)
LandContour: Flatness of the property.
discreteggplot("LandContour")
\(\textbf{Pick the next tab in order to see other variables.}\)
LotConfig: Lot configuration.
discreteggplot("LotConfig")
\(\textbf{Pick the next tab in order to see other variables.}\)
Neighborhood: Physical locations within Ames city limits.
There exists 25 possible labels for neighborhood, some form of grouping will have to be used later on.
discreteggplot("Neighborhood")
\(\textbf{Pick the next tab in order to see other variables.}\)
Condition1: Proximity to various conditions.
Condition2: Proximity to various conditions (if more than one is present).
Most of the houses have condition1 equal to condition2, one of the columns will likely be dropped.
print(paste(round(sum((combined$Condition1==combined$Condition2)/nrow(combined)*100),2),"% of houses have condition1 = condition2"))
## [1] "86.47 % of houses have condition1 = condition2"
discreteggplot("Condition1")
discreteggplot("Condition2")
\(\textbf{Pick the next tab in order to see other variables.}\)
BldgType: Type of dwelling.
HouseStyle: Style of dwelling.
House Style has some rare labels that may require grouping.
discreteggplot("BldgType")
discreteggplot("HouseStyle")
\(\textbf{Pick the next tab in order to see other variables.}\)
RoofStyle: Type of roof.
RoofMatl: Roof material.
Roof material has four levels that only occur once. These levels will either need to be dropped or grouped.
discreteggplot("RoofStyle")
discreteggplot("RoofMatl")
table(combined$RoofMatl)
##
## ClyTile CompShg Membran Metal Roll Tar&Grv WdShake WdShngl
## 1 2876 1 1 1 23 9 7
\(\textbf{Pick the next tab in order to see other variables.}\)
Exterior1st: Exterior covering on house.
Exterior2nd: Exterior covering on house (if more than one material).
Most of the time exterior1st is the same as exterior2nd, one of the predictors will likely be dropped.
print(paste(round((sum(combined$Exterior1st==combined$Exterior2nd)/nrow(combined)*100),0),"% of the time exterior1st = exterior2nd."))
## [1] "85 % of the time exterior1st = exterior2nd."
exterior1stBar <- ggplot(combined,aes(x=Exterior1st))+
geom_bar()+
theme(axis.text.x = element_text(angle = 45))
exterior1stBox <- ggplot(combined[!is.na(combined$SalePrice),],aes(x=Exterior1st,y=SalePrice))+
geom_boxplot()+
theme(axis.text.x = element_text(angle = 45))
grid.arrange(exterior1stBar,exterior1stBox, ncol=2)
exterior2ndBar <- ggplot(combined,aes(x=Exterior2nd))+
geom_bar()+
theme(axis.text.x = element_text(angle = 45))
exterior2ndBox <- ggplot(combined[!is.na(combined$SalePrice),],aes(x=Exterior2nd,y=SalePrice))+
geom_boxplot()+
theme(axis.text.x = element_text(angle = 45))
grid.arrange(exterior2ndBar,exterior2ndBox, ncol=2)
\(\textbf{Pick the next tab in order to see other variables.}\)
Foundation: Type of foundation.
discreteggplot("Foundation")
\(\textbf{Pick the next tab in order to see other variables.}\)
BsmtFinType1: Rating of basement finished area.
BsmtFinType2: Rating of basement finished area (if multiple types).
When plotting the sale price didn’t increase as would be expected when moving from no basement to good living quarters. Because of this, these two predictors will remain character instead of being made into ordered.
print(paste(round((sum(combined$BsmtFinType1==combined$BsmtFinType2)/nrow(combined))*100,0),"% of the time BsmtFinType1=BsmtFinType2"))
## [1] "32 % of the time BsmtFinType1=BsmtFinType2"
discreteggplot("BsmtFinType1")
discreteggplot("BsmtFinType2")
\(\textbf{Pick the next tab in order to see other variables.}\)
Heating: Type of heating.
Nearly all houses have gas forced heating.
discreteggplot("Heating")
table(combined$Heating)
##
## Floor GasA GasW Grav OthW Wall
## 1 2874 27 9 2 6
\(\textbf{Pick the next tab in order to see other variables.}\)
Functional: Home functionality (Assume typical unless deductions are warranted).
discreteggplot("Functional")
\(\textbf{Pick the next tab in order to see other variables.}\)
GarageType: Garage location.
discreteggplot("GarageType")
\(\textbf{Pick the next tab in order to see other variables.}\)
MiscFeature: Miscellaneous feature not covered in other categories.
discreteggplot("MiscFeature")
print(paste(round(sum(combined$MiscFeature=="none")/nrow(combined)*100,0),"% of houses don't have a miscellaneous feature."))
## [1] "96 % of houses don't have a miscellaneous feature."
\(\textbf{Pick the next tab in order to see other variables.}\)
SaleType: Type of sale.
SaleCondition: Condition of sale.
Both are dominated by one label and have a number of rare labels.
discreteggplot("SaleType")
discreteggplot("SaleCondition")
Now that the data is complete and visualized, a random forest model will be run to find baseline results. Throughout the rest of the analysis we will apply techniques to try to beat this score.
custom_summary = function(data, lev = NULL, model = NULL) {
library(Metrics)
out = rmsle(data[, "obs"], data[, "pred"])
names(out) = c("rmsle")
out
}
ctrl <- trainControl(method="cv",
number=10,
summaryFunction = custom_summary,
allowParallel = TRUE)
set.seed(123)
rfBase <- train(SalePrice~.,data=combined[!is.na(combined$SalePrice),],
trControl=ctrl,
metric="rmsle",
maximize=FALSE,
tuneGrid=expand.grid(mtry=seq(20,40,by=5)))
saveRDS(rfBase,"rfBase.rds")
rfBase <- readRDS("rfBase.rds")
min(rfBase$results$rmsle)
## [1] 0.1403416
From the baseline random forest model we are also able to gather the variable importance in predicting sales price. The top 10 listed important variables are overall quality, above ground living area, garage cars, external quality, total basement square feet, kitchen quality, 1st floor square feet, garage area, year built, and basement quality.
varImp(rfBase)
## rf variable importance
##
## only 20 most important variables shown (out of 240)
##
## Overall
## OverallQual 100.000
## GrLivArea 63.966
## GarageCars 48.072
## ExterQual 41.407
## TotalBsmtSF 33.555
## KitchenQual 30.724
## X1stFlrSF 29.987
## GarageArea 29.869
## YearBuilt 28.402
## BsmtQual 26.117
## X2ndFlrSF 17.425
## BsmtFinSF1 16.837
## FullBath 13.471
## LotArea 11.408
## TotRmsAbvGrd 10.881
## FireplaceQu 10.153
## GarageYrBlt 9.458
## YearRemodAdd 8.702
## GarageFinish 8.260
## MasVnrArea 7.889
The top 10 correlated predictors with sales price is overall quality, above ground living area, external quality, kitchen quality, garage cars, garage area, total basement square feet, first floor square feet, basement quality, and full bathrooms. 9 of the top 10 ranked predictors in the random forest model are also in the top 10 most correlated predictors.
numericVariables <- combined %>%
select_if(is.numeric)
SalePriceCorrelations <- as.data.frame(sapply(numericVariables,function(x) cor(x,numericVariables$SalePrice,use="complete.obs")))
names(SalePriceCorrelations) <- c("correlation")
SalePriceCorrelations %>%
filter(abs(correlation)>.5) %>%
arrange(desc(correlation))
## correlation
## SalePrice 1.0000000
## OverallQual 0.7909816
## GrLivArea 0.7086245
## ExterQual 0.6826392
## KitchenQual 0.6595997
## GarageCars 0.6404092
## GarageArea 0.6234314
## TotalBsmtSF 0.6135806
## X1stFlrSF 0.6058522
## BsmtQual 0.5852072
## FullBath 0.5606638
## GarageFinish 0.5492468
## TotRmsAbvGrd 0.5337232
## YearBuilt 0.5228973
## FireplaceQu 0.5204376
## GarageYrBlt 0.5080433
## YearRemodAdd 0.5071010
Highly correlated predictors will be dropped in preparing for modeling section.
corrplot(cor(numericVariables,use="complete.obs"))
Now that our data is complete and we have visualized our data, we will begin creating our own variables in the hopes of achieving better model accuracy.
Has Remodel: A feature will be created for whether or not a house has a remodel. It will take place in any house where the remodel year does not equal the year the house was built. The houses with a remodel on average sell for slightly less, this might be because houses that had a remodel are typically over 20 years older on average.
combined <- combined %>% mutate(hasRemodel=ifelse(YearBuilt!=YearRemodAdd,1,0))
#made into character for now, factor later
combined$hasRemodel <- as.character(combined$hasRemodel)
# nearly half the houses have had a remodel
table(combined$hasRemodel)
##
## 0 1
## 1560 1359
ggplot(combined[!is.na(combined$SalePrice),],aes(x=hasRemodel,y=SalePrice))+
geom_boxplot()+xlab("Has a Remodel")
#average year built with and without remodel
combined %>% group_by(hasRemodel) %>%
summarize(avgYearBuilt=mean(YearBuilt))
## # A tibble: 2 × 2
## hasRemodel avgYearBuilt
## <chr> <dbl>
## 1 0 1983.
## 2 1 1958.
combined$hasRemodel <- as.integer(combined$hasRemodel)
\(\textbf{Pick the next tab in order to see other variables.}\)
Total Bathrooms: There are four predictors for number of bathrooms, but no predictor for total bathrooms. Total Bathrooms will be created here.
#creating total bathroom predictor
combined <- combined %>% mutate(totalBathrooms=FullBath+HalfBath+BsmtFullBath+BsmtHalfBath)
ggplot(combined[!is.na(combined$SalePrice),],aes(x=as.character(totalBathrooms),y=SalePrice))+
geom_boxplot()+ xlab("Total Bathrooms")
combined$totalBathrooms <- as.double(combined$totalBathrooms)
\(\textbf{Pick the next tab in order to see other variables.}\)
Other Rooms: There was a predictor for kitchens and a predictor for bedrooms, but there was no predictor for rooms that were not kitchens or bedrooms. We will address this now.
#not bedrooms or kitchens = totalRooms-bedrooms-kitchens
combined <- combined %>% mutate(otherRooms=TotRmsAbvGrd-KitchenAbvGr-BedroomAbvGr)
discreteggplot("otherRooms")
combined$otherRooms <- as.double(combined$otherRooms)
\(\textbf{Pick the next tab in order to see other variables.}\)
Garage was Built After House: A binary predictor will be added if a garage was built after the house was built. It appears the sale price is lower if a garage was added after the house was built even though newly added garages on average are bigger. This can again be explained because the houses with added garages are on average much older.
# creating predictor
combined$newGarage <- ifelse(combined$GarageYrBlt!=combined$YearBuilt,1,0)
# most houses that have a garage were built with house
discreteggplot("newGarage")
# mean size and year built
combined %>% group_by(newGarage) %>% summarize(mean(GarageArea),
mean(YearBuilt))
## # A tibble: 2 × 3
## newGarage `mean(GarageArea)` `mean(YearBuilt)`
## <dbl> <dbl> <dbl>
## 1 0 465. 1976.
## 2 1 507. 1949.
combined$newGarage <- as.integer(combined$newGarage)
\(\textbf{Pick the next tab in order to see other variables.}\)
Has a Porch: A predictor can be added for whether or not a house has a porch.
#if these 4 predictors = 0, then no porch
combined$hasPorch <- ifelse(combined$OpenPorchSF==0 & combined$X3SsnPorch== 0 & combined$ScreenPorch== 0 & combined$EnclosedPorch==0,0,1)
discreteggplot("hasPorch")
combined$hasPorch <- as.integer(combined$hasPorch)
\(\textbf{Pick the next tab in order to see other variables.}\)
Has Pool: Creating a variable to show if a house has a pool. Appears sale price of a house with a pool is considerably higher.
# if pool area is 0, then no pool
combined$hasPool <- ifelse(combined$PoolArea==0,0,1)
discreteggplot("hasPool")
combined$hasPool <- as.integer(combined$hasPool)
\(\textbf{Pick the next tab in order to see other variables.}\)
Has a Miscellaneous Feature: There is predictor describing which miscellaneous feature, but not a feature for whether or not a house has a miscellaneous feature in general.
table(combined$MiscFeature)
##
## Gar2 none Othr Shed TenC
## 5 2814 4 95 1
combined$hasMisc <- ifelse(combined$MiscFeature=="none",0,1)
combined$hasMisc <- as.integer(combined$hasMisc)
\(\textbf{Pick the next tab in order to see other variables.}\)
Extra Age of House Predictor: Since it is unlikely that the depreciation of a house from 2005 to 2000 would be the same as a house from 1940 to 1935, a variable is created to try to address this. Using a mars model with one pivot it appears an added year in the age of a house has less meaning for houses built before 1971.
# Building mars model to find hinge point
yearBuiltBinned <- earth(SalePrice~YearBuilt,combined[!is.na(combined$SalePrice),],nprune = 2)
#building data frame of predicted values from the mars model
predictedValues <- data.frame(YearBuilt=combined[!is.na(combined$SalePrice),"YearBuilt"],predict(yearBuiltBinned,combined[!is.na(combined$SalePrice),]))
head(predictedValues)
## YearBuilt SalePrice
## 1 2003 241708.1
## 2 1976 156793.0
## 3 2001 235418.1
## 4 1915 141067.9
## 5 2000 232273.1
## 6 1993 210258.1
#plotting what it looks like with the hinge function
predictedValues %>%
ggplot(.,aes(YearBuilt,SalePrice))+
geom_smooth(color="red",size=3)+
geom_point(data=combined[!is.na(combined$SalePrice),],aes(x=YearBuilt,y=SalePrice))+
geom_point()+
annotate("segment",x=1971,xend=1971,y=0,yend=350000,color="red")+
labs(title="Hinge at 1971")
#adding the binary variable
combined$OldHouse <- ifelse(combined$YearBuilt<=1971,1,0)
combined$OldHouse <- as.integer(combined$OldHouse)
Observation 1299 will be dropped as an outlier.
ggplot(combined[!is.na(combined$SalePrice),],aes(x=TotalBsmtSF,y=SalePrice))+
geom_point()
kable(combined[1299,])
| MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | X1stFlrSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | hasRemodel | totalBathrooms | otherRooms | newGarage | hasPorch | hasPool | hasMisc | OldHouse | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1299 | 60 | RL | 313 | 63887 | 1 | noAccess | IR3 | Bnk | Corner | 0 | Edwards | Feedr | Norm | 1Fam | 2Story | 10 | 5 | 2008 | 2008 | Hip | ClyTile | Stucco | Stucco | 3 | 796 | 4 | 2 | PConc | 5 | 3 | 4 | GLQ | 5644 | Unf | 0 | 466 | 6110 | GasA | 4 | 1 | 4 | 4692 | 950 | 0 | 5642 | 2 | 0 | 2 | 1 | 3 | 1 | 4 | 12 | Typ | 3 | 4 | Attchd | 2008 | 3 | 2 | 1418 | 3 | 3 | 2 | 214 | 292 | 0 | 0 | 0 | 480 | 3 | 4 | none | 0 | 1 | 2008 | New | Partial | 160000 | 0 | 5 | 8 | 0 | 1 | 1 | 0 | 0 |
# removing obs 1299
combined <- combined[-1299,]
stringVariables <- combined %>% select_if(is.character)
makeDummy <- dummyVars(~.,data=stringVariables)
dummiedVars <- predict(makeDummy,stringVariables)
dummiedVars <- as.data.frame(dummiedVars)
Removing labels that are in under 1% of the train data. 69 dummy variables were dropped.
onePercent <- nrow(combined[!is.na(combined$SalePrice),])/100 #14.59
print(paste("Before there were",ncol(dummiedVars),"dummied variables."))
## [1] "Before there were 216 dummied variables."
# removing all predictors that are under 1%
for(i in ncol(dummiedVars):1){
#1459 is the number of train rows
if(sum(dummiedVars[1:1459,i])<onePercent){
dummiedVars <- dummiedVars[,-i]
}
}
print(paste("Now there are",ncol(dummiedVars),"dummied variables."))
## [1] "Now there are 147 dummied variables."
Box-Cox and standardization will be applied to all numeric variables that are not ordered. Since we set all ordered variables to integers, we can call all non-ordered integers very easily. Caret package will be used to achieve this.
doubleVariables <- combined %>% select_if(is.double)
kable(head(doubleVariables,3))
| LotFrontage | LotArea | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | GarageCars | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch | MiscVal | totalBathrooms | otherRooms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 65 | 8450 | 2003 | 2003 | 196 | 706 | 0 | 150 | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | 2 | 0 | 61 | 0 | 0 | 0 | 0 | 4 | 4 |
| 80 | 9600 | 1976 | 1976 | 0 | 978 | 0 | 284 | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | 2 | 298 | 0 | 0 | 0 | 0 | 0 | 3 | 2 |
| 68 | 11250 | 2001 | 2002 | 162 | 486 | 0 | 434 | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | 2 | 0 | 42 | 0 | 0 | 0 | 0 | 4 | 2 |
makeCenterScale <- preProcess(doubleVariables,method=c("BoxCox","center","scale"))
centerAndScaled <- predict(makeCenterScale,doubleVariables)
kable(head(centerAndScaled,3))
| LotFrontage | LotArea | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | GarageCars | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch | MiscVal | totalBathrooms | otherRooms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -0.1511310 | -0.1025818 | 1.0521221 | 0.8982123 | 0.5303965 | 0.5983871 | -0.2930296 | -0.9339259 | -0.4492400 | 1.2078793 | -0.1011972 | 0.5676963 | 1.0896616 | -0.2497213 | 0.7814475 | 1.2329198 | 0.1699279 | -0.2077 | 0.3070447 | -0.7402955 | 0.2016303 | -0.3596087 | -0.1033307 | -0.2859394 | -0.0895921 | 1.5274357 | 1.2601127 |
| 0.5160110 | 0.1482241 | 0.1483483 | -0.4005552 | -0.5681235 | 1.2090986 | -0.2930296 | -0.6291468 | 0.4924575 | -0.7845412 | -0.1011972 | -0.3698165 | -0.8188814 | 3.8211539 | 0.7814475 | -0.7558364 | 0.1699279 | -0.2077 | 0.3070447 | 1.6149019 | -0.7029414 | -0.3596087 | -0.1033307 | -0.2859394 | -0.0895921 | 0.6547072 | -0.4718745 |
| -0.0119914 | 0.4599805 | 0.9847553 | 0.8497954 | 0.3398369 | 0.1044293 | -0.2930296 | -0.2879762 | -0.3007951 | 1.2358759 | -0.1011972 | 0.7018913 | 1.0896616 | -0.2497213 | 0.7814475 | 1.2329198 | 0.1699279 | -0.2077 | 0.3070447 | -0.7402955 | -0.0801216 | -0.3596087 | -0.1033307 | -0.2859394 | -0.0895921 | 1.5274357 | -0.4718745 |
When applying the log transformation to the sale price we can see it follows a near normal distribution. Also, since the final Kaggle submission is in RMSLE we can now use carets base numerical summary which is RMSE.
woLog <- ggplot(combined[!is.na(combined$SalePrice),],aes(x=SalePrice))+
geom_histogram()
wLog <- combined[!is.na(combined$SalePrice),] %>%
mutate(logSalePrice=log(SalePrice)) %>%
ggplot(.,aes(x=logSalePrice))+
geom_histogram()+
xlab("log(SalePrice)")
grid.arrange(woLog,wLog,ncol=2)
# making saleprice the log(saleprice)
SalePrice <- log(combined$SalePrice)
combined <- combined %>% select(-SalePrice)
Here the data will be put back together. This means combining the ordered predictors, center and scaled predictors, and dummied predictors back to one data set.
#selecting the remaining terms (ordered predictors)
orderPredictors <- combined %>% select_if(is.integer)
combined <- cbind(SalePrice,orderPredictors,centerAndScaled,dummiedVars)
dim(combined)
## [1] 2918 200
Having labels that only exist in train but not in test can lead to over fitting. Having labels that only occur in test but not in train can lead to errors. Therefor, labels that exist only in train but not in test and vice versa will be removed. Found that this problem does not occur after removing values that existed in less than 1% of the train data.
#ordering so saleprice is first
combined <- combined %>% select(SalePrice,everything())
#creating train and test sets
train <- combined[!is.na(combined$SalePrice),]
test <- combined[is.na(combined$SalePrice),]
#if all 0's for train data portion removing that predictor from combined
counter <- 0
for(i in ncol(train):2){
if(sum(train[,i])==0){
counter=counter+1
print(names(train)[i])
combined <- combined[,-i]
}
if(i==2){print(paste("For the train data:", counter," variables were dropped"))}
}
## [1] "For the train data: 0 variables were dropped"
#if all 0's for the test data portion removing that predictor from combined
counter <- 0
for(i in ncol(test):2){
if(sum(test[,i])==0){
print(names(test)[i])
combined <- combined[,-i]
counter=counter+1
}
if(i==2){print(paste("For the test data:", counter," variables were dropped"))}
}
## [1] "For the test data: 0 variables were dropped"
The data is now pre-processed and in the format I wanted. Now, a train-test split can be performed before modeling.
train <- combined[!is.na(combined$SalePrice),]
test <- combined[is.na(combined$SalePrice),]
Throughout this modeling phase we will perform ridge, lasso, and cubist models.
#10 fold cross validation
ctrl <- trainControl(method="cv",
number=10,
verbose=FALSE)
The lasso model is good at conquering multicollinearity which is likely a problem in this data set. It achieved a train RMSLE of .1179.
set.seed(432)
lassoMod <- train(SalePrice~.,data=train,
method="glmnet",
trControl=ctrl,
tuneGrid=expand.grid(alpha=1,lambda=seq(0.001,.1,by=.001)))
saveRDS(lassoMod,"lassoMod.rds")
lassoMod <- readRDS("lassoMod.rds")
min(lassoMod$results$RMSE)
## [1] 0.1179436
The ridge model did not perform as well as the lasso model.
set.seed(223)
ridgeMod <- train(SalePrice~.,data=train,
method="glmnet",
trControl=ctrl,
tuneGrid=expand.grid(alpha=0,lambda=seq(.001,.1,by=.001)))
saveRDS(ridgeMod,"ridgeMod.rds")
ridgeMod <- readRDS("ridgeMod.rds")
min(ridgeMod$results$RMSE)
## [1] 0.1211724
The cubist model appears to perform ever so slightly better than the lasso model, but is much more complex.
set.seed(1234)
cubistMod <- train(SalePrice~.,data=train,
method="cubist",
trControl=ctrl,
tuneGrid=expand.grid(committees=seq(70,100,by=5),neighbors=c(7,8,9)))
saveRDS(cubistMod,"cubistMod.rds")
cubistMod <- readRDS("cubistMod.rds")
min(cubistMod$results$RMSE)
## [1] 0.1174491
modelResults <- resamples(list(lasso=lassoMod,
ridge=ridgeMod,
cubist=cubistMod))
bwplot(modelResults,metric="RMSE")
I submitted both the cubist and lasso model predictions. The lasso model slightly outperformed the cubist model and achieved a RMSLE of .12273.
#exp to transform back to original units
outputDF <- data.frame(ID=Id,SalePrice=exp(predict(lassoMod,test)))
write.csv(x = outputDF,file = "submission717lasso.csv",row.names=FALSE)